Voice verification circuit for validating the identity of an unknown person

ABSTRACT

A speaker verification system receives input speech from a speaker of unknown identity. The speech undergoes linear predictive coding (LPC) analysis and transformation to maximize separability between true speakers and impostors when compared to reference speech parameters which have been similarly transformed. The transformation incorporated a &#34;inter-class&#34; covariance matrix of successful impostors within a database.

TECHNICAL FIELD OF THE INVENTION

This invention relates in general to speech analysis, and moreparticularly to a high performance speaker verification system includingspeaker discrimination.

BACKGROUND OF THE INVENTION

In many applications, it is necessary to verify the identity of anunknown person. One example of an identity verification device is aphoto badge by which an interested party may compare the photo on thebadge with the person claiming an identity in order to verify the claim.This method of verification has many shortcomings. Badges are prone toloss and theft, and relatively easy duplication or adulteration.Furthermore, the inspection of the badge must be performed by a person,and is thus not applicable to many situations where the verificationmust be done by a machine. In short, an effective verification system ordevice must be cost-effective, fast, accurate, easy to use and resistantto tampering or impersonation.

Long distance credit card services, for example, must identify a user toensure that an impostor does not use the service under another person'sidentity. Prior art systems provide a lengthy identification number(calling card number) which must be entered via the phone's keypad toinitiate the long distance service. This approach is prone to abuse,since the identification number may be easily appropriated by theft, orby simply observing the entry of the identification number by another.It has been estimated that the loss to the long distance services due tounauthorized use exceeds $500,000,000 per year.

Speaker verification systems have been available for several years.However, most applications require a very small true speaker rejectionrate, and a small impostor acceptance rate. If the true speakerrejection rate is too high, then the verification system will place aburden on the users. If the impostor acceptance rate is too high, thenthe verification system may not be of value. Prior art speakerverification systems have not provided the necessary discriminationbetween true speakers and impostors to be commercially acceptable inapplications where the speaking environment is unfavorable.

Speaker verification over long distance telephone networks presentchallenges not previously overcome. Variations in handset microphonesresult in severe mismatches between speech data collected from differenthandsets for the same speaker. Further, the telephone channels introducesignal distortions which reduce the accuracy of the speaker verificationsystem. Also, there is little control over the speaker or speakingconditions.

Therefore, a need has arisen in the industry for a system to preventcalling card abuse over telephone lines. Further, a need has arisen toprovide a speaker verification system which effectively discriminatesbetween true speakers and impostors, particularly in a setting whereverification occurs over a long distance network.

SUMMARY OF THE INVENTION

In accordance with the present invention, a speaker verification methodand apparatus is provided which substantially reduces the problemsassociated with prior verification systems.

A telephone long distance service is provided using speaker verificationto determine whether a user is a valid user or an impostor. The userclaims an identity by offering some form of identification, typically byentering a calling card number on the phone's touch-tone keypad. Theservice requests the user to speak a speech sample, which issubsequently transformed and compared with a reference model previouscreated from a speech sample provided by the valid user. The comparisonresults in a score which is used to accept or reject the user.

The telephone service verification system of the present inventionprovides significant advantages over the prior art. In order to beaccepted, an impostor would need to know the correct phrase, the properinflection and cadence in repeating the phrase, and would have to havespeech features sufficiently close to the true speaker. Hence, thelikelihood of defeating the system is very small.

The speaker verification system of the present invention receives inputspeech from a speaker of unknown identity. The speech signal issubjected to an LPC analysis to derive a set of spectral and energyparameters based on the speech signal energy and spectral content of thespeech signal. These parameters are transformed to derive a template ofstatistically optimum features that are designed to maximizeseparability between true speakers and known impostors. The template iscompared with a previously stored reference model for the true speaker.A score is derived from a comparison with the reference model which maybe compared to a threshold to determine whether the unknown speaker is atrue speaker or an impostor. The comparison of the input speech to thereference speech is made using Euclidean distance measurements (the sumof the squares of distances between corresponding features) using eitherDynamic Time Warping or Hidden Markov Modeling.

In one aspect of the present invention, the transformation is computedusing two matrices. The first matrix is a matrix derived from the speechof all true speakers in a database. The second matrix is a matrixderived from all impostors in the database, those whose speech may beconfused with that of a true speaker. The second database provides abasis for discriminating between true speakers and known impostors,thereby increasing the separability of the system.

The speaker verification system of the present invention providessignificant advantages over prior art systems. First, the percentage ofimpostor acceptance relative to the percentage of true speaker rejectionis decreased. Second, the dimensionality of the transformation matrixmay be reduced, thereby reducing template storage requirements andcomputational burden.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention and theadvantages thereof, reference is now made to the following descriptiontaken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates a flow chart depicting a personal identityverification system for long-distance calling card services using speechverification;

FIG. 2 illustrates a block diagram depicting enrollment of a speakerinto the speaker verification system of the present invention;

FIG. 3 illustrates a block diagram depicting the verification andreference update used in the present invention;

FIG. 4 illustrates a block diagram depicting the verification systemused in the present invention;

FIG. 5a illustrates the vectors used to form the in-class andinter-class matrices used to form the transformation matrix in thepresent invention;

FIG. 5b illustrates a block diagram depicting the formation of thespeaker discrimination transformation matrix; and

FIG. 6 illustrates a comparison of the speaker verification system ofthe present invention as compared to a prior art speaker verificationsystem.

DETAILED DESCRIPTION OF THE INVENTION

The preferred embodiment of the present invention is best understood byreferring to FIGS. 1-6 of the drawings.

FIG. 1 illustrates a flow chart 10 depicting personal identityverification using speaker verification in connection with along-distance calling card service. In block 12, a person claims anidentity by offering some information corresponding to a uniqueidentification. For example, a long distance telephone subscriber mayenter a unique ID number to claim his identity. In other applications,such as entry to a building, a person may claim identity by presenting apicture badge.

Since the identification offered in block 12 is subject to theft and/oralteration, the personal identity verification system of the presentinvention requests a voice sample from the person in block 14. In block16, the voice sample provided by the person is compared to a storedreference voice sample which has been previously obtained for thespeaker whose identity is being claimed (the "true" speaker).Supplemental security is necessary to ensure that unauthorized users donot create a reference model for another valid user. If the voice samplecorrelates with the stored voice sample according to predefined decisioncriteria in decision block 18, the identity offered by the person isaccepted in block 20. If the match between the reference voice sampleand the input speech utterance does not satisfy the decision criteria indecision block 18, then the offered identity is rejected in block 22.

FIG. 2 illustrates a block diagram depicting enrollment of a user'svoice into the speaker verification system of the present invention.During the enrollment phase, each user of the system supplies a voicesample comprising an authorization phrase which the user will use togain access to the system. The enrollment speech sample is digitizedusing an analog-to-digital (A/D) converter 24. The digitized speech issubjected to a linear predictive coding (LPC) analysis in circuit 26.The beginning and end of the enrollment speech sample are detected bythe utterance detection circuit 28. The utterance detection circuit 28estimates a speech utterance level parameter from RMS energy (computedevery 40 msec frame) using fast upward adaptation and slow downwardadaptation. The utterance detection threshold is determined from a noiselevel estimate and a predetermined minimum speech utterance level. Theend of the utterance is declared when the speech level estimate remainsbelow a fraction (for example 0.125) of the peak speech utterance levelfor a specified duration (for example 500 msec). Typically, theutterance has a duration of 2-3 seconds.

The feature extraction circuit 30 computes a plurality of parametersfrom each frame of LPC data. In the preferred embodiment, thirty-twoparameters are computed by the feature extraction circuit 30, including:

a speech level estimate;

RMS frame energy;

a scalar measure of rate of spectral change;

fourteen filter-bank magnitudes using MEL-spaced simulated filter banksnormalized by frame energy;

time difference of frame energy over 40 msec; and

time difference of fourteen filter-bank magnitudes over 40 msec.

The feature extraction circuit 30 computes the thirty-two parameters andderives fourteen features (the least significant features are discarded)using a linear transformation of the LPC data for each frame. Theformation of the linear transformation matrix is described in connectionwith the FIG. 5. The fourteen features computed by the featureextraction circuit 30 for each 40 msec frame are stored in a referencetemplate memory 32.

FIG. 3 illustrates a block diagram depicting the verification circuit.The person desiring access must repeat the authorization phrase into thespeech verification system. Many impostors will be rejected because theydo not know the correct authorization phrase. The input speech(hereinafter "verification speech") is input to a process andverification circuit 34 which determines whether the verification speechmatches the speech submitted during enrollment. If the speech isaccepted by decision logic 36, then the reference template is updated incircuit 38. If the verification speech is rejected, then the person isrequested to repeat the phrase. If the verification speech is rejectedafter a predetermined number of repeated attempts, the user is deniedaccess.

After each successful verification, the reference template is updated byaveraging the reference and the most recent utterance (in the featuredomain) as follows:

    R.sub.new =(1-a)R.sub.old +aT

where,

a=min (max (1/n, 0.05), 0.2).

n=session index

R=reference template data

T=last accepted utterance

A block diagram depicting verification of an utterance is illustrated inFIG. 4. The verification speech, submitted by the user requestingaccess, is subjected to A/D conversion, LPC analysis, and featureextraction in blocks 40-44. The A/D conversion, LPC analysis and featureextraction are identical to the processes described in connection withFIG. 2.

The parameters computed by the feature extraction circuit 44 are inputto a dynamic time warping and compare circuit 46. Dynamic time warping(DTW) employs an optimum warping function for nonlinear time alignmentof the two utterances (reference and verification) at equivalent pointsof time. The correlation between the two utterances is derived byintegrating over time the euclidean distances between the featureparameters representing the time aligned reference and verificationutterances at each frame. The DTW and compare circuit 46 outputs a scorerepresenting the similarities between the two utterances. The score iscompared to a predetermined threshold by decision logic 36, whichdetermines whether the utterance is accepted or rejected.

In order to compute the linear transformation matrix used in the featureextraction circuits 44 and 30, a speech database is collected over agroup of users. If, for example, the speech database is to be used inconnection with a telephone network, the database speech will becollected over the long distance network to provide for the variationsand handset microphones and signal distortions due to the telephonechannel. Speech is collected from the users over a number of sessions.During each session, the users repeat a authorization phrase, such as"1650 Atlanta, Georgia" or a phone number such as "765-4321".

FIG. 5a illustrates the speech data for a single user. The databaseutterances are digitized and subjected to LPC analysis as discussed inconnection with FIG. 2. Consequently, each utterance 48 is broken into anumber of 40 msec frames 50. Each frame is represented by 32 parameters,as previously discussed herein. Each speaker provides a predeterminednumber of utterances 48. For example, in FIG. 5a, each speaker provides40 utterances. An initial linear transformation matrix [L_(d) ] or"in-class" covariance matrix [L] is derived from a principal componentanalysis performed on a pooled covariance matrix computed over all truespeakers. To compute the initial linear transformation matrix [L_(d) ],covariance matrices are computed for each speaker over the 40 (or otherpredetermined number) time aligned database utterances 48. Thecovariance matrices derived for each speaker in the database are pooledtogether and diagonalized. The initial linear transformation matrix ismade up of the eigenvectors of the pooled covariance matrix. Theresulting diagonalized initial linear transform matrix will havedimensions of 32×32; however, the resulting matrix comprisesuncorrelated features ranked in decreasing order of statisticalvariance. Therefore, the least significant features may be discarded.The resulting initial linear transformation (after discarding the leastsignificant features) accounts for approximately 95% of the totalvariance in the data.

In an important aspect of the present invention, the initial lineartransformation matrix is adjusted to maximize the separability betweentrue speakers and impostors in a given data base. Speaker separabilityis a more desirable goal than creating a set of statisticallyuncorrelated features, since the uncorrelated features may not be gooddiscriminant features.

An inter-class or "confusion" covariance matrix is computed over alltime-aligned utterances for all successful impostors of a given truespeaker. For example, if the database shows that the voice data suppliedby 120 impostors (anyone other than the true speaker) will be acceptedby the verification system as coming from the true speaker, a covariancematrix is computed for these utterances. The covariance matricescomputed for impostors of each true speaker are pooled over all truespeakers. The covariance matrix corresponding to the pooled impostordata is known as the "inter-class" or "confusion" covariance matrix [C].

To compute the final linear transformation matrix [LT], the initiallinear transformation covariance matrix [L] is diagonalized, resultingin a matrix [L_(d) ]. The matrix [L_(d) ] is multiplied by the confusionmatrix [C] and is subsequently diagonalized. The resulting matrix is thelinear transformation matrix [LT]. The block diagram showing computationof the linear transformation matrix is illustrated in FIG. 5b in blocks52-58.

The transformation provided by the confusion matrix further rotates thespeech feature vector to increase separability between true speakers andimpostors. In addition to providing a higher impostor rejection rate,the transformation leads to a further reduction in the number offeatures used in the speech representation (dimensionality), since onlythe dominant dimensions need to be preserved. Whereas, aneighteen-feature vector per frame is typically used for the principalspectral components, it has been found that a fourteen-feature vectormay be used in connection with the present invention. The smallerfeature vector reduces the noise inherent in the transformation.

Experimental results comparing impostor acceptance as a function of truespeaker rejection is shown in FIG. 6. In FIG. 6, curve "A" illustratesimpostor acceptance computed without use of the confusion matrix. Curve"B" illustrates the impostor acceptance using the confusion matrix toprovide speaker discrimination. As can be seen, for a true speakerrejection of approximately two percent, the present invention reducesthe impostor acceptance by approximately ten percent.

In addition to the dynamic time warping (time-aligned) method ofperforming the comparison of the reference and verification utterances,a Hidden Markov Model-based (HMM) comparison could be employed. An HMMcomparison would provide a state-by-state comparison of the referenceand verification utterances, each utterance being transformed asdescribed hereinabove. It has been found that a word by word HMMcomparison is preferable to a whole-phrase comparison, due to theinaccuracies caused mainly by pauses between words.

Although the present invention has been described in detail, it shouldbe understood that various changes, substitutions and alterations can bemade herein without departing from the spirit and scope of the inventionas defined by the appended claims.

What is claimed is:
 1. A method of verifying the identity of an unknownperson, whereby the unknown person's identity is determined to be eithera true speaker or an imposter, comprising the steps of:receiving inputspeech from the unknown person; coding the input speech into a set ofpredetermined spectral and energy parameters; transforming theparameters based on a predetermined statistically maximizeddiscrimination between true speakers and successful impostors; andcomparing the transformed parameters with a stored reference model toverify the identity of the unknown person.
 2. The method of claim 1wherein said step of transforming comprises the step of transforming theparameters with a linear transform matrix.
 3. The method of claim 2wherein said step of transforming the parameters with a linear transformmatrix comprises the steps of:forming a database of speech samples froma plurality of speakers; coding said speech samples into a plurality ofparameters; creating an in-class covariance matrix based on theparameters of all true speakers in the database; creating an inter-classcovariance matrix based on the parameters of successful imposters in thedatabase; and creating the linear transform matrix based on saidin-class covariance matrix and said inter-class covariance matrix. 4.The method of claim 3 wherein said step of creating the linear transformmatrix comprises the steps of:determining a transformation bydiagonalizing the in-class covariance matrix; multiplying theinter-class covariance matrix by said transformation; and diagonalizingthe matrix formed in said multiplying step.
 5. The method of claim 1wherein said step of coding the speech comprises the step of performinglinear predictive coding on said speech to generate spectralinformation.
 6. The method of claim 5 wherein said step of coding thespeech further comprises the step of performing linear predictive codingon said speech to generate energy information.
 7. The method of claim 6wherein said step of coding said speech further comprises the step ofdigitizing said speech prior to said steps of performing said linearpredictive coding.
 8. A method of verifying the identity of an unknownperson over a telephone network, whereby the unknown person's identityis determined to be either a true speaker or an imposter, comprising thesteps of:receiving input speech from the unknown person over thetelephone network; coding the input speech into a set of predeterminedspectral and energy parameters; transforming the parameters based on apredetermined statistically maximized discrimination between truespeakers and successful impostors; and comparing the transformedparameters with a stored reference model to verify the identity of theunknown person.
 9. The method of claim 8 wherein said step oftransforming comprises the step of transforming the parameters with alinear transform matrix.
 10. The method of claim 9 wherein said step oftransforming the parameters with a linear transform matrix comprises thesteps of:forming a database of speech samples from a plurality ofspeakers over the telephone network; coding said speech samples into aplurality of parameters; creating a in-class covariance matrix based onthe parameters of all true speakers in the database; creating ainter-class covariance matrix based on the parameters of successfulimpostors in the database; and creating the linear transform matrixbased on said in-class covariance matrix and said inter-class covariancematrix.
 11. The method of claim 10 wherein said step of creating thelinear transform matrix comprises the steps of:determining atransformation by diagonalizing the in-class covariance matrix;multiplying the inter-class covariance matrix by said transformation;and diagonalizing the matrix formed in said multiplying step.
 12. Themethod of claim 8 wherein said step of coding the speech comprises thestep of performing linear predictive coding on said speech to generatespectral information.
 13. The method of claim 12 wherein said step ofcoding the speech further comprises the step of performing linearpredictive coding on said speech to generate energy information.
 14. Themethod of claim 13 wherein said step of coding said speech furthercomprises the step of digitizing said speech prior to said steps ofperforming said linear predictive coding.
 15. Apparatus for verifyingthe identity of an unknown person, whereby the unknown person's identityis determined to be either a true speaker or an imposter,comprising:circuitry for receiving input speech from the unknown person;circuitry for coding the input speech into a set of predeterminedspectral and energy parameters; circuitry for transforming theparameters based on a predetermined statistically maximizeddiscrimination between true speakers and successful imposters; andcircuitry for comparing the transformed parameters with a storedreference model to verify the identity of the unknown person.
 16. Theapparatus of claim 15 wherein said circuitry for transforming comprisescircuitry for transforming the parameters with a linear transformmatrix.
 17. The apparatus of claim 16 wherein said circuitry fortransforming the parameters with a linear transform matrixcomprises:circuitry for forming a database of speech samples from aplurality of speakers; circuitry for coding said speech samples into aplurality of parameters; circuitry for creating a in-class covariancematrix based on the parameters of all true speakers in the database;circuitry for creating a inter-class covariance matrix based son theparameters of successful impostors in the database; and circuitry forcreating the linear transform matrix based on said in-class covariancematrix and said inter-class covariance matrix.
 18. The apparatus ofclaim 17 wherein said circuitry for creating the linear transform matrixcomprises:circuitry for determining a transformation by diagonalizingthe in-class covariance matrix; circuitry for multiplying theinter-class covariance matrix by said transformation; and circuitry fordiagonalizing the product of the inter-class covariance matrixmultiplied by the diagonalized in-class covariance matrix.
 19. Theapparatus of claim 15 wherein said coding circuitry comprises circuitryfor performing linear predictive coding on said speech to generatespectral information.
 20. The apparatus of claim 19 wherein said codingcircuitry further comprises circuitry for performing linear predictivecoding on said speech to generate energy information.
 21. The apparatusof claim 20 wherein said coding circuitry further comprises circuitryfor digitizing said speech prior to performing said linear predictivecoding.